Unsupervised Learning for stocks clustering (PCA K-MEANS HCA)¶

Project combines the following main steps:

  • Wikipedia web-scraping to collect S&P500 companies' tickers, names and sector information
  • yfinance stock data extraction to collect daily prices
  • Returns computation for the desired frequency
  • Random sampling of stocks to select the desired subset size
  • Principle Component Analysis (PCA)
  • K-Means Clustering
  • Hierarchical Clustering Analysis (HCA)
  • Correlation Matrices comparison
  • Clusters composition visualization
  • Assessment of clustering methods
  • Appendix: silhouette and elbow methods to find optimal k

Imports¶

In [1]:
import pandas as pd
import numpy as np
import random
import yfinance as yf

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly
plotly.offline.init_notebook_mode(connected=True)

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import fcluster,linkage, dendrogram, leaves_list

Data Preparation¶

Loading tickers¶

In [2]:
### GET S&P 500 constituents from Wikipedia ###
def get_sp500_cons():
    '''
    Extract S&P 500 companies from wikipedia and store tickers and Sectors / Industries as df
    Returns a df of tickers, name, sectors, and industries
    '''
    URL = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    df = pd.read_html(URL)[0]
    df['Symbol'] = df['Symbol'].str.replace('.','-')
    df = df.drop(['Headquarters Location','Date added','CIK','Founded'],axis=1)
    df = df.sort_values(by=['GICS Sector','GICS Sub-Industry'])
    df = df.set_index('Symbol')
    df.dropna(inplace=True)
    return df
In [3]:
# Tickers scrapping
data = get_sp500_cons()
data.head()
/var/folders/sr/2_q1kwss0cg7rt_ny655psg80000gn/T/ipykernel_8565/3925123783.py:9: FutureWarning:

The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.

Out[3]:
Security GICS Sector GICS Sub-Industry
Symbol
IPG Interpublic Group of Companies (The) Communication Services Advertising
OMC Omnicom Group Communication Services Advertising
WBD Warner Bros. Discovery Communication Services Broadcasting
CHTR Charter Communications Communication Services Cable & Satellite
CMCSA Comcast Communication Services Cable & Satellite

Collecting Prices¶

In [4]:
### GET prices from yfinance ###
def get_prices(tickers,start='1995-12-31'):
    '''
    Dowload prices from yfinance from a list of tickers. 
    Returns a df of daily prices
    '''
    prices  = yf.download(tickers, start=start,interval='1d',)
    prices = prices['Adj Close']
    # fwd fill last prices for missing daily prices
    prices = prices.asfreq('D').ffill()
    return prices
In [5]:
# Collect prices from yfinance into stock_data df
ticker_list = data.index.to_list()
start_date = '2015-01-01'
stock_data = get_prices(tickers=ticker_list,start=start_date)
[*********************100%***********************]  503 of 503 completed
In [6]:
stock_data.head()
Out[6]:
A AAL AAP AAPL ABBV ABC ABT ACGL ACN ADBE ... WYNN XEL XOM XRAY XYL YUM ZBH ZBRA ZION ZTS
Date
2014-12-31 38.084969 50.814606 147.124634 24.767366 45.798306 78.874535 38.343548 19.700001 77.527657 72.699997 ... 130.989746 27.941647 62.913483 49.859367 34.242157 44.820728 103.148308 77.410004 23.585787 40.600727
2015-01-01 38.084969 50.814606 147.124634 24.767366 45.798306 78.874535 38.343548 19.700001 77.527657 72.699997 ... 130.989746 27.941647 62.913483 49.859367 34.242157 44.820728 103.148308 77.410004 23.585787 40.600727
2015-01-02 37.823860 51.079906 146.459518 24.531763 46.113235 79.136978 38.241344 19.496668 77.119682 72.339996 ... 129.343124 28.097225 63.172108 48.605175 34.251160 44.513115 102.393486 77.430000 23.403790 40.864925
2015-01-03 37.823860 51.079906 146.459518 24.531763 46.113235 79.136978 38.241344 19.496668 77.119682 72.339996 ... 129.343124 28.097225 63.172108 48.605175 34.251160 44.513115 102.393486 77.430000 23.403790 40.864925
2015-01-04 37.823860 51.079906 146.459518 24.531763 46.113235 79.136978 38.241344 19.496668 77.119682 72.339996 ... 129.343124 28.097225 63.172108 48.605175 34.251160 44.513115 102.393486 77.430000 23.403790 40.864925

5 rows × 503 columns

Computing Returns¶

In [7]:
### COMPUTE Returns on specified frequency ###
def get_returns(prices_df,freq='D'):
    '''
    Input daily prices df of prices. Use freq as 'D','B', 'W', or 'M' for daily, business, weekly, or monthly.
    Output returns_df in selected frequency
    '''
    prices_d = prices_df.dropna(axis=0,how='all')
    prices_d = prices_d.dropna(axis=1)
    prices_res = prices_d.resample(freq).last()
    prices_res = prices_res.dropna(axis=1)
    returns = prices_res.pct_change().dropna()
    return returns
In [8]:
# Set Frequency of returns
freq = 'W'
# Compute returns
returns = get_returns(stock_data,freq=freq)
returns.head(3)
Out[8]:
A AAL AAP AAPL ABBV ABC ABT ACGL ACN ADBE ... WYNN XEL XOM XRAY XYL YUM ZBH ZBRA ZION ZTS
Date
2015-01-11 0.000740 -0.035058 0.010974 0.024513 -0.001669 0.028079 0.006681 0.010600 0.010581 -0.006912 ... 0.014705 0.001661 -0.007864 0.015983 -0.071166 0.015342 0.049915 0.040165 -0.078827 0.021704
2015-01-18 -0.057650 -0.042484 -0.064254 -0.053745 -0.011485 -0.006237 -0.010498 0.002368 -0.009913 -0.001531 ... -0.013821 0.023217 -0.010640 -0.037907 -0.025445 -0.008422 -0.010236 0.029426 -0.049117 -0.000678
2015-01-25 0.014641 0.118049 0.038534 0.065950 -0.032693 0.024995 -0.014161 0.002700 0.003713 0.032483 ... -0.006667 0.019990 -0.002525 0.002561 0.020308 0.023195 -0.000171 0.015438 0.001211 -0.000385

3 rows × 479 columns

Random Sample¶

In [66]:
### SAMPLE N stocks => returns_df ###
def get_sample(returns,N=returns.shape[1]):
    '''
    Sample N stocks randomly from stocks in returns df columns
    Return a returns_df of random sampled stocks
    '''
    sample = random.sample(returns.columns.to_list(),N)
    return returns[sample]
In [67]:
# Sample N stocks
returns_df = get_sample(returns,N=350)
returns_df.head(3)
Out[67]:
PEG XYL HSY ILMN VZ ADSK PAYX ISRG WFC NTAP ... BWA PCG PNC SNA CPRT AMP XOM CAG PEP SHW
Date
2015-01-11 -0.013604 -0.071166 0.026034 0.050588 0.007521 -0.021166 0.019464 -0.008943 -0.036928 -0.015444 ... -0.022287 0.046682 -0.044593 -0.004109 -0.012575 -0.033386 -0.007864 0.011725 0.025201 0.045649
2015-01-18 0.053714 -0.025445 0.039940 -0.056365 0.026305 -0.024369 0.003394 0.018738 -0.019742 -0.046740 ... -0.050822 0.041734 -0.021401 -0.023355 -0.030731 -0.023653 -0.010640 -0.000276 0.004854 -0.002187
2015-01-25 0.010333 0.020308 -0.018435 0.072571 -0.017504 0.023571 0.022833 -0.024612 0.031371 0.018839 ... 0.078346 0.012208 0.021791 0.013428 0.053985 0.038665 -0.002525 0.014628 0.012745 0.000694

3 rows × 350 columns

Principle Component Analysis (PCA)¶

In [68]:
### RUN PCA ###
def train_PCA(returns, n_comp=20):
    """
    From returns df, compute n_comp PCA and returns W,pca,X_proj,cum_var
    """
    # Set X
    X = returns
    # Standardize returns into X_scal
    scaler = StandardScaler()
    scaler.fit(X)
    X_scal = pd.DataFrame(scaler.transform(X), columns=X.columns, index=X.index) 
    # Run PCA
    pca = PCA(n_comp)
    pca.fit(X_scal)
    # Get PCA loadings
    W = pca.components_
    W = pd.DataFrame(W.T,
                     index=X.columns,
                     columns=pca.get_feature_names_out())
    # Print cum explained variance by n_comp components
    cum_var = np.cumsum(pca.explained_variance_ratio_)
    print(f'Total explained variance:{np.round(cum_var.max(),2)} with {n_comp} PCs')
    
    X_proj = pca.transform(X_scal)
    X_proj = pd.DataFrame(X_proj, columns=pca.get_feature_names_out(),index=X_scal.index)
    
    return W,pca,X_proj,cum_var
In [69]:
# Run PCA on returns df 
W,pca,X_proj,cum_var = train_PCA(returns_df,n_comp=20)

# Plot total cumulative explained variance using n_comp PCs
import plotly
plotly.offline.init_notebook_mode(connected=True)
fig = px.bar(cum_var,
              width=800,
             height=600,
              title='Cumulative Variance explained by PCs',
            )
fig.update_layout(xaxis_title='PCs', yaxis_title='Cum Var',showlegend=False,)
fig.show()
             
Total explained variance:0.67 with 20 PCs
In [70]:
# Quick Viz of PC orthogonality
plt.figure(figsize=(12,8))
sns.heatmap(X_proj.corr(), cmap='coolwarm')
plt.title('Correlation Heatmap - Princpal Components')
plt.show()
In [71]:
# Plot PC0 vs PC1 - Use Sectors as colors
df = W.join(data[['Security','GICS Sector']])
fig = px.scatter(df, x='pca0', y='pca1',
                    title='PCA - PC1 vs PC2',
                    color='GICS Sector',
                 width=1000,
                 height=700,
                    hover_data=[df.index,df.Security],
               )

fig.show()
In [72]:
# Plot top 3 PCs
fig = px.scatter_3d(df, x='pca0', y='pca1', z='pca2',
                    title='PCA - Top 3 PCs',
                    width=1000,
                    height=700,
                    color='GICS Sector',
                    hover_data=[df.index,df.Security],
                   )

fig.show()

K-Means Clustering¶

In [73]:
### GET K-Means clusters labels based on PCA ###
def get_kmeans_clusters(W,k=11):
    """
    From W matrix of PCA loadings for each stock, 
    train K-Means to cluster stocks in k clusters.
    Returns clusters labels assigned to each stock in a df
    """
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(W)
    labels = kmeans.labels_
    # Assign stocks to clusters
    clusters_k = pd.DataFrame(labels,index=W.index,columns=['cluster'])
    clusters_k = clusters_k.sort_values(by='cluster')
    return clusters_k

Run K-Means¶

In [74]:
# Set number of clusters
k = 11 # in-line with GICS Sector classification

# PCA + K-Means cluster labels
clusters_k = get_kmeans_clusters(W,k=k)

HCA - Hierachical Cluster Analysis¶

In [75]:
### RUN  HCA ###
def get_hca_clusters(X,k=returns_df.shape[1]/10):
    """
    From X returns_df compute Z linkage matrix on C cov Matrix, 
    use fcluster to split linkage matrix into k clusters
    Returns k clusters labels assigned to each stock in a df
    """
    # Build Covariance Matrix C
    C = X.cov()

    # Compute the linkage matrix using Ward's method
    Z = linkage(C, method='ward', metric='euclidean')

    # Use the fcluster function to split the dendogram into k clusters
    labels = fcluster(Z, k, criterion='maxclust')

    # Assign cluster labels to stocks 
    clusters_h = pd.DataFrame(labels,index=X.columns,columns=['cluster'])
    clusters_h = clusters_h.sort_values(by='cluster')
    return clusters_h, Z
In [76]:
# Set number of clusters
k = 11 # in-line with GICS Sector classification

# Get clusters and linkage matrix Z
clusters_h, Z = get_hca_clusters(returns_df,k=k)
In [77]:
# Plot Dendogram chart
plt.figure(figsize=(18,8))
dendrogram(Z)
plt.title(f'Dendrogram of {returns_df.shape[1]} sampled stocks')
plt.xlabel('Stocks')
plt.ylabel('Distance')
plt.show()

Correlation Matrices - Clustering methods comparison¶

In [78]:
# Sort stocks based on HCA leaves (Dendogram level 0)
stocks_names_HCA_sorted = returns_df.T.iloc[leaves_list(Z)].T.columns

# Sort stocks by GICS Sectors
GICS_sorted = returns_df.T.join(data['GICS Sector']).sort_values('GICS Sector')
In [79]:
# Plot Correlation Heatmaps
X = returns_df.copy()

fig = make_subplots(rows=2, cols=2, subplot_titles=('Unclustered Stocks', 
                                                    f'GICS Sectors',
                                                    f'PCA & K-Means Clustering',
                                                   'Hierarchical Clustering',
                   ))

heatmap1 = px.imshow(X.corr())
heatmap2 = px.imshow(X[GICS_sorted.index].corr())
heatmap3 = px.imshow(X[clusters_k.index].corr())
heatmap4 = px.imshow(X[stocks_names_HCA_sorted].corr())


fig.add_trace(heatmap1.data[0], row=1, col=1)
fig.add_trace(heatmap2.data[0], row=1, col=2)
fig.add_trace(heatmap3.data[0], row=2, col=1)
fig.add_trace(heatmap4.data[0], row=2, col=2)

fig.update_layout(height=1000, 
                  width=900, 
                  showlegend=True,
                  coloraxis={'colorscale': 'viridis'},
                 title='Correlation Heatmaps Comparison')

fig.show()

Clustering details by stock & sector / within & between correlation assessment¶

In [80]:
# PCA-K-Means clustering details
clusters = clusters_k

df = clusters.join(data)
KMEAN_ret = returns_df.T.join(df).groupby('cluster').mean().T
grouped = df.groupby(['cluster', 'GICS Sector', 'Security']).size().reset_index(name='count')

fig = px.bar(grouped, x='cluster', y='count', color='GICS Sector', title=f'PCA & K-Means clustering',
             barmode='stack',hover_data=['Security'],text='Security',
            height=800,
            width=1400)

fig.update_layout(xaxis_title='Cluster', yaxis_title='Security count', hovermode='closest')
fig.show()
In [81]:
# Within-cluster correlation
df = clusters_k.join(data)
within_kmean_corr = []
for c in set(df.cluster.values):
    temp = returns_df.T.join(df)
    temp = temp[temp.cluster == c].T
    temp = temp[:-4]
    temp = temp.astype(float)
    corr_mat = temp.corr()
    upper = np.triu(corr_mat,k=1)
    corr_mean = np.mean(upper[upper!=0])
    if not np.isnan(corr_mean):
        within_kmean_corr.append(corr_mean)
within_corr_mean = np.mean(within_kmean_corr)

# Between-cluster correlation
corr_mat = KMEAN_ret.corr()
upper = np.triu(corr_mat,k=1)
between_corr_mean = np.mean(upper[upper!=0])

corr_comp_df = pd.DataFrame([within_corr_mean,between_corr_mean],index=['Within Corr','Between Corr'],columns=['PCA-KMEANS'])
corr_comp_df
Out[81]:
PCA-KMEANS
Within Corr 0.525310
Between Corr 0.656626
In [82]:
# HCA Clustering details
clusters = clusters_h

df = clusters.join(data)
HCA_ret = returns_df.T.join(df).groupby('cluster').mean().T
grouped = df.groupby(['cluster', 'GICS Sector', 'Security']).size().reset_index(name='count')

fig = px.bar(grouped, x='cluster', y='count', color='GICS Sector', title=f'Hierarchical clustering',
             barmode='stack',hover_data=['Security'],text='Security',
            height=800,
            width=1400)

fig.update_layout(xaxis_title='Cluster', yaxis_title='Security count', hovermode='closest')
fig.show()
In [83]:
# Within-cluster correlation
df = clusters_h.join(data)
within_hca_corr = []
for c in set(df.cluster.values):
    temp = returns_df.T.join(df)
    temp = temp[temp.cluster == c].T
    temp = temp[:-4]
    temp = temp.astype(float)
    corr_mat = temp.corr()
    upper = np.triu(corr_mat,k=1)
    corr_mean = np.nanmean(upper[upper!=0])
    if not np.isnan(corr_mean):
        within_hca_corr.append(corr_mean)
within_corr_mean = np.mean(within_hca_corr)

# Between-cluster correlation
corr_mat = HCA_ret.corr()
upper = np.triu(corr_mat,k=1)
between_corr_mean = np.mean(upper[upper!=0])

corr_comp_df['HCA'] = [within_corr_mean,between_corr_mean]
corr_comp_df
/var/folders/sr/2_q1kwss0cg7rt_ny655psg80000gn/T/ipykernel_8565/1808345271.py:11: RuntimeWarning:

Mean of empty slice

Out[83]:
PCA-KMEANS HCA
Within Corr 0.525310 0.519254
Between Corr 0.656626 0.590020
In [84]:
# GICS Clustering details
clusters = returns_df.T

df = clusters.join(data)
GICS_ret = df.groupby('GICS Sector').mean().T
grouped = df.groupby(['GICS Sector', 'Security']).size().reset_index(name='count')

fig = px.bar(grouped, x='GICS Sector', y='count', color='GICS Sector', title=f'GICS Sectors clustering',
             barmode='stack',hover_data=['Security'],text='Security',
            height=800,
            width=1400)

fig.update_layout(xaxis_title='Cluster', yaxis_title='Security count', hovermode='closest')
fig.show()
In [85]:
# Within-cluster correlation
df = returns_df.T.join(data)
within_GICS_corr = []
sector = np.unique(df['GICS Sector'].values)
for c in sector:
    temp = df.copy()
    temp = temp[temp['GICS Sector'] == c].T
    temp = temp[:-4]
    temp = temp.astype(float)
    corr_mat = temp.corr()
    upper = np.triu(corr_mat,k=1)
    corr_mean = np.nanmean(upper[upper!=0])
    if not np.isnan(corr_mean):
        within_GICS_corr.append(corr_mean)
within_corr_mean = np.mean(within_GICS_corr)

# Between-cluster correlation
corr_mat = GICS_ret.corr()
upper = np.triu(corr_mat,k=1)
between_corr_mean = np.mean(upper[upper!=0])

corr_comp_df['GICS'] = [within_corr_mean,between_corr_mean]
corr_comp_df
Out[85]:
PCA-KMEANS HCA GICS
Within Corr 0.525310 0.519254 0.506066
Between Corr 0.656626 0.590020 0.687749

APPENDIX: Optimal k number of clusters (silhouette, within cluster distance)¶

In [86]:
low = 2
high = 50
step = 1

# Calculate silhouette scores for different values of k
from sklearn.metrics import silhouette_score
silhouette_scores = []
sse=[]
W,_,_,_= train_PCA(returns_df,n_comp=20)


for i in range(low,high,step):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(W)
    labels = kmeans.labels_
    # Assign stocks to clusters
    clusters_k = pd.DataFrame(labels,index=W.index,columns=['cluster'])
    clusters_k = clusters_k.sort_values(by='cluster')
    
    silhouette_avg = silhouette_score(returns_df.T,labels)
    silhouette_scores.append(silhouette_avg) # silhouette
    sse.append(kmeans.inertia_) # within-cluster sse
Total explained variance:0.67 with 20 PCs
In [87]:
# Plot the silhouette scores
plt.plot(range(low,high,step),silhouette_scores,marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()
In [88]:
# Plot within cluster sum of squares
plt.plot(range(low,high,step),sse,marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster sse')
plt.show()

Thanks for reading !¶

About me

  • https://www.linkedin.com/in/gribiere/
  • https://github.com/GuillaumeRib
  • https://troopl.com/gribiere